Information-Theoretic Segmentation of Natural Language

نویسندگان

  • Sascha S. Griffiths
  • Mariano Mora McGinity
  • Jamie Forth
  • Matthew Purver
  • Geraint A. Wiggins
چکیده

We present computational experiments on language segmentation using a general information-theoretic cognitive model. We present a method which uses the statistical regularities of language to segment a continuous stream of symbols into “meaningful units” at a range of levels. Given a string of symbols—in the present approach, textual representations of phonemes—we attempt to find the syllables such as grea and sy (in the word greasy); words such as in, greasy, wash, and water ; and phrases such as in greasy wash water. The approach is entirely information-theoretic, and requires no knowledge of the units themselves; it is thus assumed to require only general cognitive abilities, and has previously been applied to music. We tested our approach on two spoken language corpora, and we discuss our results in the context of learning as a statistical processes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Plant Classification in Images of Natural Scenes Using Segmentations Fusion

This paper presents a novel approach to automatic classifying and identifying of tree leaves using image segmentation fusion. With the development of mobile devices and remote access, automatic plant identification in images taken in natural scenes has received much attention. Image segmentation plays a key role in most plant identification methods, especially in complex background images. Wher...

متن کامل

Text classification in Asian languages without word segmentation

We present a simple approach for Asian language text classification without word segmentation, based on statistical -gram language modeling. In particular, we examine Chinese and Japanese text classification. With character -gram models, our approach avoids word segmentation. However, unlike traditional ad hoc -gram models, the statistical language modeling based approach has strong information...

متن کامل

Word Boundary Information and Chinese Word Segmentation

Chinese word segmentation could be considered as a problem of word boundary recognition. Word boundary information plays a significant role in human language acquisition and automatic segmentation for Natural Language Processing (NLP). Extraction of word boundary information involves cognitive psychology, computational linguistics, and language education. Methods utilizing word boundary informa...

متن کامل

Topic Segmentation and Labeling in Asynchronous Conversations

Topic segmentation and labeling is often considered a prerequisite for higher-level conversation analysis and has been shown to be useful in many Natural Language Processing (NLP) applications. We present two new corpora of email and blog conversations annotated with topics, and evaluate annotator reliability for the segmentation and labeling tasks in these asynchronous conversations. We propos...

متن کامل

An information theoretic approach for using word cluster information in natural language call routing

In this paper, an information theoretic approach for using word clusters in natural language call routing (NLCR) is proposed. This approach utilizes an automatic word class clustering algorithm to generate word classes from the word based training corpus. In our approach, the information gain (IG) based term selection is used to combine both word term and word class information in NLCR. A joint...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015